rBahadur: efficient simulation of structured high-dimensional genotype data with applications to assortative mating

Existing methods for generating synthetic genotype data are ill-suited for replicating the effects of assortative mating (AM). We propose rb_dplr, a novel and computationally efficient algorithm for generating high-dimensional binary random variates that effectively recapitulates AM-induced genetic architectures using the Bahadur order-2 approximation of the multivariate Bernoulli distribution. The rBahadur R library is available through the Comprehensive R Archive Network at https://CRAN.R-project.org/package=rBahadur. Supplementary Information The online version contains supplementary material available at 10.1186/s12859-023-05442-6.


S1 Methods for sampling MVB random variates
We consider the problem of drawing the realization of m random variables X 1 , X 2 , . . . , X m from the multivariate Bernoulli (MVB) distribution. For m such random variables, there are 2 m possible outcomes. Following Teugels, [1], for k ∈ where each k n ∈ {0, 1}. Let p (m) ∈ R 2 m be a vector of probabilities defined as p (m) k def = P{X 1 = k 1 , X 2 = k 2 , . . . , X m = k m }.
Let σ (m) ∈ R 2 m be a vector of (central) moments, defined elementwise via where µ n def = E[X n ] is the nth marginal probability. We assume each µ n ∈ (0, 1) as otherwise the corresponding random variable is simply constant. Note that σ (m) 1 = 1, and σ (m) k = 0 for all k with binary expansion (k 1 , . . . , k m ) such that n k n = 1 (i.e., all first order central moments are zero). For an index k ∈ [2 n ] with binary expansion (k 1 , . . . , k n ), we use will use the notations σ The second order moments in σ (m) are precisely those on positions k = (k 1 , . . . , k m ) with n k n = 2. Table 1 shows an example of σ (m) in the case m = 4, where σ ij denotes the second order central moment involving X i and X j .
For a vector x ∈ R m and integers a and b satisfying 1 ≤ a < b ≤ m, let x a:b denote the subvector . . .
Notice that σ (4) 1:8 = σ (3) . This is true in general: σ We will useγ (n) ∈ R n to denote the vector containing the n potentially nonzero elements in γ (n) . More precisely,γ For each n ∈ [N ], let B (n) ∈ R 2×2 be defined as From Theorem 1 in Teugels [1], we have where ⊗ denotes the matrix Kronecker product.
Our goal is to sample X 1 , . . . , X m iteratively: We first sample X 1 , then X 2 given the outcome of X 1 , then X 3 given the outcome of X 1 and X 2 , and so forth. By leveraging Proposition S1.3, we can construct an algorithm which efficiently samples m random variables from the MVB. Notice that computing an element v (n) i via (7) costs O(n), which would bring the total cost of computing all elements of v (n) to O(n 2 ). Since a different v (n) is needed for each conditional probability in (8), such an approach would bring the total cost of the sampling algorithm to O(m 3 ). A key insight to making the sampling efficient is that v (n) can be computed from v (n−1) via an update which only costs O(n). For n ∈ [m], assume that p k1···kn > 0 and define the constants Proposition S1.4. For n ∈ {2, 3, . . . , m}, assuming p k1···kn > 0, v (n) and c (n) satisfy the recursions and Proof. Since p k1···kn > 0, this implies that p kn|k1···kn−1 > 0, which means the divisions in (10) and (11) are well-defined. Equation (10) follows from the definition of v (n) in (7) and the fact that p kn|k1···kn−1 = p k1···kn /p k1···kn−1 . Equation (11) follows immediately from the definition of c (n) in (9).
Suppose the second order moments are stored in a matrix C ∈ R m×m : Notice that the vectorsγ (n) = C 1:n,n+1 can be read directly from C.
In Algorithm S1, we give a high-level algorithm for our sampling approach. Both of our proposed methods rb_unstr and rb_dplr follow this high-level algorithm, but differ on how they perform the substeps. The steps that differ between the algorithms are the initialization on line 4, the computation of the conditional probability on line 6, and the update on line 10. In Algorithm S2 we specify how these steps are done in rb_unstr for a generic second moment matrix with the third and higher moments equal to zero. Notice that the division on line 4a in Algorithm S2 is well-defined, since p k1 must be positive. Similarly, the computations on lines 10a and 10b in Algorithm S2 involve divisions by p kn|k1···kn−1 which is also guaranteed to be positive. The correctness of line 6a in Algorithm S2 follows from Proposition S1.3.

S1.1 Efficient algorithm for diagonal plus low-rank second moments
We now consider the case when C = D + UU , where D ∈ R m×m is diagonal and U ∈ R m×r for r < m.
As discussed in Remark S1.5, the main cost in Algorithm S2 are on lines 6a and 10a. When C is diagonalplus-low-rank, there is a clear relationship between subsequent vectorsγ (n) . This can be leveraged to avoid computing the inner product on line 6a. Moreover, the step on line 10a can also be avoided: Instead of computing v (n) , we can instead keep track of r inner products that are fast to update from one step of the for loop to another.
Let u ( ) be the th column of U. Then Algorithm S2: Generate Bahadur order-2 MVB variates with arbitrary correlation structures (rb_unstr) /* Same as Algorithm S1 with lines 4, 6 and 10 defined as below */ 4 Initialize parameters: 10 Update parameters:

10b
Compute c (n) according to (11) It follows thatγ The inner product v (n) γ (n) can be written where each x :n is an inner product. As Proposition S1.6 shows, these r numbers can be updated efficiently.
Algorithm S3 shows how rb_dplr is implemented. It uses Proposition S1.6 to reduce the complexity of Algorithm S2 in the case when C is diagonal-plus-low-rank. Notice that the lines 4a and 10a of Algorithm S3 involve divisions by p k1 and p kn|k1···kn−1 , respectively, which are both guaranteed to be positive, so these computations are well-defined.
Algorithm S3: Generate Bahadur order-2 MVB variates with diagonal-plus-low-rank (DPLR) correlation structures (rb_dplr) /* Same as Algorithm S1 with lines 4, 6 and 10 defined as below */ 4 Initialize parameters:  O(mr). For a general U this is optimal asymptotically, since there are up to mr nonzero elements in U that need to be considered.
Remark S1.8 (Exchangeable correlation). With exchangeable correlation structure, all off-diagonal elements of C are the same, i.e., C ij = α when i = j. This is a diagonal-plus-rank-1 structure: With u ∈ R m such that all entries equal to √ α, C = D + uu for some diagonal matrix D. Algorithm S3 therefore handles this case with a complexity of O(m).

S2 Simulating equilibrium AM with simple local LD
Here we demonstrate using rBahadur to simulate genotype / phenotype data at equilibrium under AM using am_simulate, which takes the following parameters: • h2_0 : panmictic heritability

S3 Simulating equilibrium AM with complex local LD
The greatest limitation of using the proposed methods to simulate genotype data is that the desired first and second moments must result in a valid Bahadur order-2 MVB distribution. Specifically, from Proposition 3 of [2], we must have that the minimal eigenvalue of the target correlation matrix R satisfies where σ j = µ j (1 − µ j ). In practice, this makes it difficult to accommodate strong local LD among nearby loci. To address this limitation, we propose the two step procedure outlined in Algorithm S4. Specifically, we first generate causal variants using rb_dplr and empirical allele frequencies at m of p m SNPs across the genome. We then determine a partition of the genome such that each slice contains one and only one causal locus, using a recombination map to select boundaries. For each resulting haplotype block, we then choose haplotypes from a real or synthetic panmictic founder population from among those which agree with the simulated variant at the casual locus. Proceeding in this manner, we are able to benefit from the computational efficiency of rb_dplr while generating genotype data with realistic local LD.
As a proof of concept, we implement this approach for biallelic single nucleotide variants on chromosome 22 in the following vignette, using phase three data provided by the 1000 Genomes Project [3].
We first load and parse the VCF data, restricting to 173,545 common loci with minor allele frequency greater than 0.01: library(rBahadur) library(data. For simplicity, we draw haplotype block boundaries uniformly at random between causal variants, though a genetic map could also be used: Finally, we construct indices corresponding to haplotype blocks in the 1000 genomes reference data matching our synthetic data at causal loci and use these to reconstruct the full chromosome: ## sum to obtain diploid haplotypes sim_diploid <-sim_maternal + sim_paternal These haplotypes are equal to our MVB replicates at each causal locus while preserving local LD. Comparing allele frequencies at all loci between the 1000 genomes reference data and the synthetic genotypes yields F st = 3.5e-4.